Evolution repeats itself in replicate long-term studies in the wild

The extent to which evolution is repeatable remains debated. Here, we study changes over time in the frequency of cryptic color-pattern morphs in 10 replicate long-term field studies of a stick insect, each spanning at least a decade (across 30 years of total data). We find predictable “up-and-down” fluctuations in stripe frequency in all populations, representing repeatable evolutionary dynamics based on standing genetic variation. A field experiment demonstrates that these fluctuations involve negative frequency-dependent natural selection (NFDS). These fluctuations rely on demographic and selective variability that pushes populations away from equilibrium, such that they can reliably move back toward it via NFDS. Last, we show that the origin of new cryptic forms is associated with multiple structural genomic variants such that which mutations arise affects evolution at larger temporal scales. Thus, evolution from existing variation is predictable and repeatable, but mutation adds complexity even for traits evolving deterministically under natural selection.


Hierarchical model for long-term field studies
We used data on morph frequencies in T. cristinae from 1990 to 2023 (32,867 stick insect observations, with 213 population by year combinations on Adenostoma and 206 population by year combinations on Ceanothus) to estimate the average frequency of striped morphs on Adenostoma and Ceanothus host plants each year of the time series.We did this using a hierarchical Bayesian generalized linear model.We assumed a binomial likelihood for the number of striped stick insects for each population and year, with a population (j) and year (i) specific probability of stripe (p ij ) and sample size (n ij ).Melanistic stick insects were excluded from this analysis.Thus, our estimate of stripe frequency is out of the total number of striped and green stick insects.We then defined a linear model for the logit frequency of stripe for each Adenostoma and Ceanothus population for each year as: Here, α and β are population and year specific effects, where population refers to a specific location and host plant.We placed separate normal priors on the α parameters for Adenostoma and Ceanothus populations, with means (µ A and µ C ) and precisions (i.e., reciprocal of the variance, τ A and τ C ) estimated from the data.Separate normal priors were also placed on the year effects for Adenostoma and Ceanothus with means set to 0 and precisions estimated from the data.Sum-to-zero constraints were imposed on the year effects to ensure model identifiability.We then placed mostly uninformative priors on these higher level means and precisions, µ ∼ normal(mean = 0, precision = 1e −6 ) and τ ∼ gamma(shape = 0.01, rate = 0.001).
Given the data sample size, these priors contain a trivial amount of information, e.g., the gamma priors denote a prior sample size of 2 * shape or 0.02 observations compared to the the 32,867 data observations, and thus the priors have a negligible effect on the posterior distributions.
We fit this model via Markov chain Monte Carlo (MCMC) using the rjags (version 4. 14) interface with JAGS (98).We ran three chains each with a 10,000 iteration burn-in followed by 20,000 iterations for sampling with a thinning interval of 5. We verified likely convergence of the MCMC algorithm to the posterior distribution by computing Gelman and Rubin's potential scale reduction factor using the gelman.diagfunction from coda (version 0.19.4) (99).The upper bound of the 95% confidence interval for the potential scale reduction factor was less than 1.01 for all model parameters, consistent with convergence.Effective sample sizes for model parameters ranged from 495 to 12,696 (median = 9011, mean = 8257).This analysis was done using R version 4.2.2.

Hierarchical model for estimating D from the time-series data
We characterized NFDS dynamics and expected outcomes from our stripe-frequency time series by estimating D, defined as D = ∂∆p ∂p p=p .We specifically estimated D for each of the 10 T. cristinae locations with 10 or more pairs of consecutive years (i.e., pairs of p and ∆p) (see Table S1 for details about the 10 locations).We did this using a hierarchical Bayesian model.
We accounted for uncertainty in stripe frequency in each location (j) and year (i) by assuming the observed stripe count was binomially distributed, such that y ij ∼ binomial(p ij , n ij ), where p ij is the true (unobserved) stripe frequency and n ij is the sample size.We then defined the following linear model: Here, α j is a location-specific intercept, D j is the location-specific slope parameter of interest (D), and ϵ ij is an error turn.We placed hierarchical priors on intercept and slope parameters, The hyperparameters were given the following weakly informative priors and inferred from the data: µ α ∼ normal(mean = 0, SD = 20), µ D ∼ normal(mean = 0, SD = 20), σ α ∼ gamma(2, 0.1), and σ D ∼ gamma(2, 0.1).We assumed a normal prior on the error terms with a mean of 0 and a standard deviation estimated from the data; the latter was given the following prior, σ ∼ gamma(2, 0.1).We fit this model using Hamiltonian Monte Carlo (HMC) (100) via the rstan (version 2.21.8) interface with Stan (101).We used the No-U-Turn (NUTS) sampler (102) and ran four HMC chains each comprising a 1000 iteration warmup and 1000 additional sampling iterations.We computed and examined the Gelman-Rubin convergence diagnostic (i.e., the potential scale reduction factor) to verify adequate HMC mixing and likely convergence of the HMC algorithm to the posterior distribution (all estimates were less than 1.01).Effective sample sizes for model parameters ranged from 410 to 7258 (median = 3941, mean = 3787).We repeated the analysis using less informative priors to assess the sensitivity of the results to the priors selected.Specifically, we increased the standard deviations for the normal priors to 100 (versus 20) and changed the gamma shape parameters to 1 (instead of 2).This resulted in poorer mixing (lower effective sample sizes) but yielded very similar estimates of D (Pearson correlation for D between alternative model specifications = 0.99, 95% confidence interval = 0.96 to 1.00).
We assessed repeatability of evolutionary dynamics and outcomes across the 10 locations in terms of whether estimates of D were all consistent with NFDS (D < 0) and suggestive of a similar outcome (e.g., all −1 < D < 0 indicative of gradual convergence to an equilibrium), and based on the variability (e.g., standard deviation) of D across locations (a smaller standard deviation implies greater repeatability, see, e.g., (103)).

NFDS theory
Analytical theory for NFDS presented in the main text (Figure 5) is adapted from (104) .We assumed the relative fitnesses of A 1 A 1 homozygotes, A 1 A 2 heterozygotes, and A 2 A 2 homozygotes depended on the genotype frequencies as follows: , and w 22 = 1 − s b γ 12 + s b γ 11 (for A 2 A 2 ).Here, s denotes the effect of heterozygote frequency on the fitness of heterozygotes, s b captures frequency-dependence for homozygotes (57), and the γ terms denote the genotype frequencies.Genotype frequencies after selection were calculated as γ ′ ij = γ ij w ij / w, where w is the mean fitness.Allele frequencies were calculated from genotype frequencies.Results in Figure 5  Model fitting for the field experiment testing the form of the NFDS function We fit a Bayesian model to test for NFDS and quantify the relationship between stripe frequency and fitness in the T. cristinae NFDS experiment.For this, we assumed that the number of green and striped stick insects recaptured from an experimental treatment (i.e., a single bush with a unique initial stripe frequency) was binomially distributed with a morph and treatment-specific survival probability (i.e., absolute fitness), y green ).Here, i denotes the treatment index (i.e., the index for the initial stripe frequency) and w green i and w striped i denote the absolute fitnesses of green and striped morphs in treatment i.We then defined a pair of linear equations that specify the absolute fitness and relative fitness of the stripe morph as a function of the initial stripe frequency, In these equations, the µ terms serve as intercepts and the β terms capture the effects of initial stripe frequency on average survival probability (average of the two absolute fitnesses) and relative fitness of the stripe, here . From these equations, it follows that the absolute fitness of the green morph is w green i = 2 wi − w striped i .We placed weakly informative normal priors on the µ and β terms, all normal(mean = 0, SD = 10).We fit this model using HMC via the rstan (version 2.21.8) interface with Stan (101).For this, we used four HMC chains with the NUTS sampler, each comprising 4000 warmup iterations followed by 4000 sampling iterations, and set the target average proposal acceptance probability (adapt delta) to 0.9 (the default value is 0.8).We computed and examined the Gelman-Rubin convergence diagnostic to verify adequate HMC mixing and likely convergence of the HMC algorithm to the posterior distribution (all estimates of the potential scale reduction factor were ≤1.01).Effective sample sizes for core model parameters (i.e., not transformed or derived parameters) ranged from 717 to 3299 (median = 1208, mean = 1479).
We next compared this model to an alternative model allowing for a non-linear relationship between stripe frequency and fitness.The model was identical to the linear model described above, except a sigmoid function was used to model relative fitness, Here, L denotes the maximum relative fitness, k specifies the steepness of the curve and c is the stripe frequency at which the relative fitness is halfway between its minimum and maximum value.Depending on the parameter values, this function can approximate a step function (very steep curves) or a near-linear function (very shallow curves).We placed weakly informative normal priors on L and k, each with a mean of 0 and SD of 10, and an uninformative beta prior on c (both shape parameters set to 1).The same HMC specifications were used for this second model and again the Gelman-Rubin diagnostic indicated likely convergence to the posterior (i.e., all estimates of the potential scale reduction factor were ≤1.01).
We used approximate leave-one-out cross validation to compare this model to the linear model described above.This was done with the loo function from rstan (105).As shown in Table S3 and Figure S4, the two models performed similarly with the linear model marginally outperforming the sigmoid model.However, even more importantly, both models produced in near linear relationships between stripe frequency and fitness.In other words, the best sigmoidal model approximated a linear model not a step model and is quite similar to the linear model despite the different prior structure specified by the model.
We fit an additional model to estimate D in a manner analogous to the time-series data.
This second Bayesian model was identical to that described above for estimating D from the time-series data, except that a single D was inferred (rather than one per locality) and thus the model was not hierarchical.As such, we used weakly informative priors on α (the intercept) and D, both normal(mean = 0, SD = 1).We likewise placed a weakly informative prior on the residual standard deviation, σ ∼ gamma(2, 0.1).Here, we again used four HMC chains with the NUTS sampler; in this case, each chain comprised a 3000 iteration warmup followed by 20,000 iterations for sampling and adapt delta was set to 0.95.Once again, we computed and examined the Gelman-Rubin convergence diagnostic to verify adequate HMC mixing and likely convergence of the HMC algorithm to the posterior distribution.All estimates of the potential scale reduction factor were 1.00, consistent with convergence; effective sample sizes for core model parameters ranged from 155 to 9437 (median = 719, mean = 1407).We further verified that moderate changes to the specified priors did not unduly affect our inference.Specifically, we increased the normal standard deviation parameters from one to five and decreased the gamma shape parameters from two to one (both specifying less constraining/informative priors) and obtained nearly identical estimates of D (0.76 versus 0.77) (this was despite a decrease in mixing indicated by reduced effective sample sizes with the alternative, more permissive priors).
The equilibrium stripe frequency for the experimental location was predicted from these results (with the original prior specifications) as p = −α D .All analyses described in this section were conducted using R version 4.2.2.

Hierarchical Bayesian model testing for dampened oscillations
We fit a hierarchical Bayesian model for stripe frequency change (∆p ij ) as a function of time (year, which is equivalent to generation) using only the consistently sampled years from 2000 onward for this analysis.The hypothesis of dampening oscillations would predict a negative effect of year on the absolute value of stripe frequency change.We assumed a normal likelihood for stripe frequency change, specifically ∆p ij ∼ normal(µ ij , σ), with a linear model for the expected change as a function of time (i) and locality (j) In this equation, α j is a location-specific intercept and β j denotes the location-specific effect of year on the absolute value of change.We placed hierarchical priors on the regression parameters, α j ∼ normal(µ α , σ α ) and β j ∼ normal(µ β , σ β ).We then placed weakly informative priors on the hyperparameters and estimated these from the data: µ α ∼ normal(mean = 0, SD = 1), µ β ∼ normal(mean = 0, SD = 1), σ α ∼ gamma(2, 0.1), and σ β ∼ gamma(2, 0.1).
We likewise placed a weakly informative prior on the residual standard deviation term, σ ∼ gamma(2, 0.1).We fit this model using HMC (100)  A Chicago library was prepared as described in (106).Briefly, ∼500ng of high molecular weight gDNA was reconstituted into chromatin in vitro and fixed with formaldehyde.Fixed chromatin was digested with DpnII, the 5' overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated.After ligation, crosslinks were reversed and the DNA purified from protein.Purified DNA was treated to remove biotin that was not internal to ligated fragments.The DNA was then sheared to ∼350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters.
Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library.The libraries were sequenced on an Illumina HiSeq X to a target depth of 30X coverage.
A Dovetail Hi-C library was prepared in a similar manner as described previously (107).
Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted.Fixed chromatin was digested with DpnII, the 5' overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated.After ligation, crosslinks were reversed and the DNA purified from protein.Purified DNA was treated to remove biotin that was not internal to ligated fragments.The DNA was then sheared to ∼350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters.
Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library.The libraries were sequenced on an Illumina HiSeq X to a target depth of 30X coverage.
The PacBio de novo assembly, Chicago library reads, and Dovetail Hi-C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies ( 106).An iterative analysis was conducted.First, Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu).The separations of Chicago read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold.After aligning and scaffolding Chicago data, Dovetail Hi-C library sequences were aligned and scaffolded following the same method.

Methods for delimiting the bounds of the color-pattern loci
As noted in the main text, we used local PCAs in combination with the GWA signal to delineate two large genetic regions (loci) associated with color pattern variation, denoted Pattern and Mel-Stripe2.We first performed separate (local) PCA ordinations of the genetic data (centered but not standardized genotype matrixes) for the 5890 SNPs on chromosome 8 in 100-SNP sliding windows (5791 windows total).We then summarized each PCA by the eigenvalue associated with the first eigenvector.Visual inspection of sliding window eigenvalues suggested two peaks of high eigenvalues (accentuated structure) on chromosome 8 coinciding with the GWA signal.
We formally delimited the color-pattern loci using a Hidden Markov model (HMM) approach.Specifically, we fit a HMM to the eigenvalues using the HiddenMarkov package (version 1.8.13) in R (108).We modeled two hidden states, which we initialized with expected values equal to the 25th and 75th percentiles of the empirical eigenvalue distribution for chromosome 8.We assumed a normal distribution for the observed eigenvalues with standard deviations initialized at half the empirical standard deviation.We then estimated the hidden state means, standard deviations, and transitions between hidden states using the Baum-Welch algorithm (i.e., we set initial values for means and standard deviations but these were then refined with the Baum-Welch algorithm) (109).For this, we allowed a maximum of 500 iterations and set the tolerance to 1e −4 .Using this procedure, we identified states corresponding to high (mean eigenvalue = 2.67, SD = 0.43) and low or background (mean eigenvalue = 1.87,SD = 0.14) population structure (i.e., high and low eigenvalues).We then used the Viterbi algorithm for decoding, that is for inferring the most likely hidden state for each 100 SNP window (110).

Hierarchical Bayesian model testing the selective advantage of the striped morph on Adenostoma
We fit a hierarchical Bayesian generalized linear model to determine whether striped T. cristinae were more fit than melanic morphs on Adenostoma.We assumed a binomial likelihood for the number of striped stick insects recaptured for each block (j), y j ∼ binomial(p j , n j ).
Here, p j and n j denote the block-specific probability that a recaptured stick insect was striped and the block-specific number of stick insects recaptured.We assumed logit(p j ) = β + α j , where β denotes the mean effect of being striped (relative to melanic) and α is a replicatespecific deviation from this mean.We modeled the α hierarchically with a zero-centered normal prior, α j ∼ normal(mean = 0, SD = σ), and placed weakly informative priors on β and σ, β ∼ normal(mean = 0, SD = 10) and σ ∼ normal(mean = 0, SD = 1).We again fit this model with HMC and the NUTS sampled via the rstan interface with Stan.We used four HMC chains, each comprising a 1000 iteration warmup followed by an additional 1000 sampling iterations.We again computed and examined the Gelman-Rubin convergence diagnostic to verify adequate HMC mixing and likely convergence of the HMC algorithm to the posterior distribution (all values of the potential scale reduction factor were ≤1.01, consistent with convergence); effective sample sizes for core model parameters ranged from 964 to 8661 (median = 3974, mean = 4038).We then focused on the posterior distribution of β, the stripe effect, for inference.This analysis was conducted with R version 4.2.2.Using more permissive (less constraining or informative) priors again had little effect on inference.Specifically, estimates of β were very similar (posteriors medians of 1.49 versus 1.58 for the original versus less constraining priors) and estimates of the replicate effects (α's) were highly correlated (Pearson correlation = 0.998, 95% confidence intervals = 0.989 to 1.00).

Reference genome DNA sequencing and assembly for Timema podura
As noted in the main text, we created a de novo reference genome for T. podura using a combination of PacBio and Illumina reads from a Omni-C genomic library (all performed by Dovetail Genomics).The DNA sample from a single female T. podura was quantified using Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).The PacBio SMRTbell library (∼20kb) for PacBio Sequel was constructed using SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer recommended protocol.The library was bound to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II).
Sequencing was performed on PacBio Sequel II 8M SMRT cells.
wtdbg2 (111) was then run to generate a primary assembly.Next, blobtools v1.1.1 ( 112) was used to identify potential contamination in the assembly based on blast (v2.9) results of the assembly against the NT database.A fraction of the scaffolds was identified as contaminant and were removed from the assembly.The filtered assembly was then used as an input to purge dups v1.1.2(113), and potential haplotypic duplications were removed from the assembly, resulting in the final assembly.
For the Dovetail Omni-C library, chromatin was fixed in place with formaldehyde in the nucleus.Fixed chromatin was digested with DNaseI and then extracted, chromatin ends were repaired and ligated to a biotinylated bridge adapter followed by proximity ligation of adapter containing ends.After proximity ligation, crosslinks were reversed and the DNA purified.Purified DNA was treated to remove biotin that was not internal to ligated fragments.Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters.
Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library.The library was sequenced on an Illumina HiSeqX platform to produce ∼30X sequence coverage.
The input PacBio de novo assembly and Dovetail Omni-C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (106).The Dovetail Omni-C library sequences were aligned to the draft input assembly using bwa (91).The separations of Dovetail Omni-C read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to 13 score prospective joins, and make joins above a threshold.Although the same data are used to estimate observed and predicted frequencies, the strong association detected in not a given and would not arise under all models of evolution (e.g., those with directional evolution rather than fluctuations due to NFDS).Panel (A) shows the general relationship between the mean frequency in a time series and the equilibrium predicted for directional selection favoring a morph or negative frequency-dependent selection.The relationship for the observed data is shown in (B) with points colored to denote locations as in Figure 1 of the main text (Pearson correlation = 0.99, P < 0.0001).T. podura T. cristinae Raw proportions from these counts are also reported.Lastly, climate data for each location, which are not analyzed in the current manuscript, are provided for a subsite of sites/years.
via the rstan (version 2.21.8) interface with Stan(101).We used the NUTS sampler(102) and ran four HMC chains each comprising a 1000 iteration warmup and 1000 additional sampling iterations.We again computed and examined the Gelman-Rubin convergence diagnostic to verify adequate HMC mixing and likely convergence of the HMC algorithm to the posterior distribution (all estimates of the potential scale reduction factors were 1.00).Effective sample sizes for core model parameters ranged from 455 to 2279 (median = 737, mean = 855).Reference genome DNA sequencing and assembly for the green Timema cristinae morph As noted in the main text, a single female stick insect from the locality Paradise Road North (PRCN, latitude 34.53 • N, longitude 119.85 • W, collected on Ceanothus spinosus in 2019) was used for the assembly.The DNA sample was quantified using Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).The PacBio SMRTbell library (∼20kb) for PacBio Sequel was constructed using SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) using the manufacturer recommended protocol.The library was bound to polymerase using the Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II.Sequencing was performed on PacBio Sequel SMRT cells.An initial genome assembly was then performed using the FALCON 1.8.8 pipeline from Pacific Bioscience.The assembly was polished through PacBio's Arrow algorithm from SMRT Link 5.0.1, using the original raw-reads.

Figure S2 :
FigureS2: Replicate time series reveal a close match between observed and predicted equilibrium morph frequencies under negative frequency-dependent selection.Although the same data are used to estimate observed and predicted frequencies, the strong association detected in not a given and would not arise under all models of evolution (e.g., those with directional evolution rather than fluctuations due to NFDS).Panel (A) shows the general relationship between the mean frequency in a time series and the equilibrium predicted for directional selection favoring a morph or negative frequency-dependent selection.The relationship for the observed data is shown in (B) with points colored to denote locations as in Figure1of the main text (Pearson correlation = 0.99, P < 0.0001).

Figure S3 :Figure S4 :Figure S5 :Figure S6 :FigureFigure S8 :
FigureS3: Predicted equilibrium stripe frequency as a function of the prevalence of Adenostoma at a locality.The scatterplot shows the percentage of Adenostoma (versus Ceanothus) and the predicted stripe frequency from the NFDS time series for each of the 10 focal sites.The best fit line from linear regression, which explained 93% of the variation in equilibrium stripe frequency (P < 0.0001), is shown.

Figure S9 :
Figure S9: Heatmap shows the proportion synteny blocks on of each T. podura scaffold that aligned to each of the T. cristinae chromosomes.Results are shown for the striped morph.

Table S2 :
Bayesian estimates of the slope parameters relating the change in stripe frequency to time.Point estimates and 95% equal-tail probability intervals (ETPIs) for each of the 10 focal locations are shown.

Table S3 :
Comparison of linear and sigmoid functions for relative fitness in the negative frequency-dependent selection (NFDS) experiment based on approximate leave-one-out (LOO) cross validation.Estimates and standard errors (in parentheses) are shown for the Bayesian LOO estimate of the expected log pointwise predictive density (ELPD LOO ), the effective number of parameters (P LOO ), and the LOO information criterion (LOOIC).Smaller values of LOOIC denote better models in terms of expected predictive performance.

Table S4 :
Comparison of mean absolute stripe frequency change and null expectations with drift and negative frequency-dependent selection (NFDS) assuming an effective population size of 110.Results are shown for each of 10 locations.Mean absolute change, the ratio of mean change to the mean null expectation (X-fold change) and the P -value from the null hypothesis test that the observed change did not exceed expectations for drift and NFDS are given.